Adaptivity and Computation-Statistics Tradeoffs for Kernel and Distance based High Dimensional Two Sample Testing

نویسندگان

  • Aaditya Ramdas
  • Sashank J. Reddi
  • Barnabás Póczos
  • Aarti Singh
  • Larry A. Wasserman
چکیده

Nonparametric two sample testing is a decision theoretic problem that involves identifying differences between two random variables without making parametric assumptions about their underlying distributions. We refer to the most common settings as mean difference alternatives (MDA), for testing differences only in first moments, and general difference alternatives (GDA), which is about testing for any difference in distributions. A large number of test statistics have been proposed for both these settings. This paper connects three classes of statistics high dimensional variants of Hotelling’s t-test, statistics based on Reproducing Kernel Hilbert Spaces, and energy statistics based on pairwise distances. We ask the following question how much statistical power do popular kernel and distance based tests for GDA have when the unknown distributions differ in their means, compared to specialized tests for MDA? To answer this, we formally characterize the power of popular tests for GDA like the Maximum Mean Discrepancy with the Gaussian kernel (gMMD) and bandwidth-dependent variants of the Energy Distance with the Euclidean norm (eED) in the high-dimensional MDA regime. We prove several interesting properties relating these classes of tests under MDA, which include (a) eED and gMMD have asymptotically equal power; furthermore they also enjoy a free lunch because, while they are additionally consistent for GDA, they have the same power as specialized high-dimensional t-tests for MDA. All these tests are asymptotically optimal (including matching constants) for MDA under spherical covariances, according to simple lower bounds. (b) The power of gMMD is independent of the kernel bandwidth, as long as it is larger than the choice made by the median heuristic. (c) There is a clear and smooth computation-statistics tradeoff for linear-time, subquadratic-time and quadratic-time versions of these tests, with more computation resulting in higher power. 1 ar X iv :1 50 8. 00 65 5v 1 [ m at h. ST ] 4 A ug 2 01 5 All three observations are practically important, since point (a) implies that eED and gMMD while being consistent against all alternatives, are also automatically adaptive to simpler alternatives, point (b) suggests that the median “heuristic” has some theoretical justification for being a default bandwidth choice, and point (c) implies that expending more computation may yield direct statistical benefit by orders of magnitude.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

TESTING FOR “RANDOMNESS” IN SPATIAL POINT PATTERNS, USING TEST STATISTICS BASED ON ONE-DIMENSIONAL INTER-EVENT DISTANCES

To test for “randomness” in spatial point patterns, we propose two test statistics that are obtained by “reducing” two-dimensional point patterns to the one-dimensional one. Also the exact and asymptotic distribution of these statistics are drawn.

متن کامل

TESTING STATISTICAL HYPOTHESES UNDER FUZZY DATA AND BASED ON A NEW SIGNED DISTANCE

This paper deals with the problem of testing statisticalhypotheses when the available data are fuzzy. In this approach, wefirst obtain a fuzzy test statistic based on fuzzy data, and then,based on a new signed distance between fuzzy numbers, we introducea new decision rule to accept/reject the hypothesis of interest.The proposed approach is investigated for two cases: the casewithout nuisance p...

متن کامل

A robust least squares fuzzy regression model based on kernel function

In this paper, a new approach is presented to fit arobust fuzzy regression model based on some fuzzy quantities. Inthis approach, we first introduce a new distance between two fuzzynumbers using the kernel function, and then, based on the leastsquares method, the parameters of fuzzy regression model isestimated. The proposed approach has a suitable performance to<b...

متن کامل

Equivalence of Distance-based and Rkhs-based Statistics in Hypothesis Testing by Dino Sejdinovic,

We provide a unifying framework linking two classes of statistics used in two-sample and independence testing: on the one hand, the energy distances and distance covariances from the statistics literature; on the other, maximum mean discrepancies (MMD), that is, distances between embeddings of distributions to reproducing kernel Hilbert spaces (RKHS), as established in machine learning. In the ...

متن کامل

Equivalence of Distance-based and Rkhs-based Statistics in Hypothesis Testing by Dino Sejdinovic, Bharath Sriperumbudur,

We provide a unifying framework linking two classes of statistics used in two-sample and independence testing: on the one hand, the energy distances and distance covariances from the statistics literature; on the other, maximum mean discrepancies (MMD), that is, distances between embeddings of distributions to reproducing kernel Hilbert spaces (RKHS), as established in machine learning. In the ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1508.00655  شماره 

صفحات  -

تاریخ انتشار 2015